recursive training
- Asia > Singapore (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- (3 more...)
A theoretical basis for model collapse in recursive training
Our analysis will draw heavily upon the three topics in probability theory mentioned above. We briefly summarize the relevant results here. These can be found respectively, in [3] (see also [12] for a more extensive treatment), [11], and [2] (see also [9] for a more extensive treatment), respectively. A. Convergence of probability measures: Let S be a Polish space, i.e., a separable topological space with its topology compatible with a complete metric. Let B denote its Borel σ -field, i.e., the smallest σ -field containing its open sets.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (2 more...)
Knowledge Collapse in LLMs: When Fluency Survives but Facts Fail under Recursive Synthetic Training
Keisha, Figarri, Wu, Zekun, Wang, Ze, Koshiyama, Adriano, Treleaven, Philip
Large language models increasingly rely on synthetic data due to human-written content scarcity, yet recursive training on model-generated outputs leads to model collapse, a degenerative process threatening factual reliability. We define knowledge collapse as a distinct three-stage phenomenon where factual accuracy deteriorates while surface fluency persists, creating "confidently wrong" outputs that pose critical risks in accuracy-dependent domains. Through controlled experiments with recursive synthetic training, we demonstrate that collapse trajectory and timing depend critically on instruction format, distinguishing instruction-following collapse from traditional model collapse through its conditional, prompt-dependent nature. We propose domain-specific synthetic training as a targeted mitigation strategy that achieves substantial improvements in collapse resistance while maintaining computational efficiency. Our evaluation framework combines model-centric indicators with task-centric metrics to detect distinct degradation phases, enabling reproducible assessment of epistemic deterioration across different language models. These findings provide both theoretical insights into collapse dynamics and practical guidance for sustainable AI training in knowledge-intensive applications where accuracy is paramount.
- North America > United States (0.28)
- Indian Ocean (0.04)
- Information Technology (0.93)
- Government (0.68)
Machine-generated text detection prevents language model collapse
Drayson, George, Lampos, Vasileios
As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since online data is the primary resource for LLM pre-training, subsequent models could be trained on an unknown portion of synthetic samples. This will lead to model collapse, a degenerative process whereby LLMs reinforce their own errors, and ultimately yield a declining performance. In this study, we investigate the impact of decoding strategy on model collapse, analysing the characteristics of text at each model generation, the similarity to human references, and the resulting model performance. Using the decoding strategies that lead to the most significant degradation, we evaluate model collapse in more realistic scenarios where the origin of the data (human or synthetic) is unknown. We train a machine-generated text detector and propose an importance sampling approach to alleviate model collapse. Our method is validated on two LLM variants (GPT-2 and SmolLM2) on the open-ended text generation task. We demonstrate that it can not only prevent model collapse but also improve performance when sufficient human-authored samples are present.
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Recursive Training of 2D-3D Convolutional Networks for Neuronal Boundary Prediction
Efforts to automate the reconstruction of neural circuits from 3D electron microscopic (EM) brain images are critical for the field of connectomics. An important computation for reconstruction is the detection of neuronal boundaries. Images acquired by serial section EM, a leading 3D EM technique, are highly anisotropic, with inferior quality along the third dimension. For such images, the 2D max-pooling convolutional network has set the standard for performance at boundary detection. Here we achieve a substantial gain in accuracy through three innovations.
How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse
Seddik, Mohamed El Amine, Chen, Suei-Wen, Hayou, Soufiane, Youssef, Pierre, Debbah, Merouane
The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training loop makes the tails of the original distribution disappear, thereby making future-generation models forget about the initial (real) distribution. With the aim of rigorously understanding model collapse in language models, we consider in this paper a statistical model that allows us to characterize the impact of various recursive training scenarios. Specifically, we demonstrate that model collapse cannot be avoided when training solely on synthetic data. However, when mixing both real and synthetic data, we provide an estimate of a maximal amount of synthetic data below which model collapse can eventually be avoided. Our theoretical conclusions are further supported by empirical validations.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.15)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
Recursive Training of 2D-3D Convolutional Networks for Neuronal Boundary Prediction
Lee, Kisuk, Zlateski, Aleksandar, Ashwin, Vishwanathan, Seung, H. Sebastian
Efforts to automate the reconstruction of neural circuits from 3D electron microscopic (EM) brain images are critical for the field of connectomics. An important computation for reconstruction is the detection of neuronal boundaries. Images acquired by serial section EM, a leading 3D EM technique, are highly anisotropic, with inferior quality along the third dimension. For such images, the 2D max-pooling convolutional network has set the standard for performance at boundary detection. Here we achieve a substantial gain in accuracy through three innovations.
Learning to Self-Train for Semi-Supervised Few-Shot Classification
Sun, Qianru, Li, Xinzhe, Liu, Yaoyao, Zheng, Shibao, Chua, Tat-Seng, Schiele, Bernt
Few-shot classification (FSC) is challenging due to the scarcity of labeled training data (e.g. only one labeled data point per class). Meta-learning has shown to achieve promising results by learning to initialize a classification model for FSC. In this paper we propose a novel semi-supervised meta-learning method called learning to self-train (LST) that leverages unlabeled data and specifically meta-learns how to cherry-pick and label such unsupervised data to further improve performance. To this end, we train the LST model through a large number of semi-supervised few-shot tasks. On each task, we train a few-shot model to predict pseudo labels for unlabeled data, and then iterate the self-training steps on labeled and pseudo-labeled data with each step followed by fine-tuning. We additionally learn a soft weighting network (SWN) to optimize the self-training weights of pseudo labels so that better ones can contribute more to gradient descent optimization. We evaluate our LST method on two ImageNet benchmarks for semi-supervised few-shot classification and achieve large improvements over the state-of-the-art.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Germany > Saarland (0.04)
- Asia > Singapore (0.04)
- (2 more...)